A Quantitative Approach to Investigating the Hypothesis of Prokaryotic Intron Loss

نویسنده

Robert Sinclair

چکیده

Using a novel method, we show that ordered triplets of motifs usually associated with spliceosomal intron recognition are underrepresented in the protein coding sequence of complete Thermotogae, archaeal and bacterial genomes. The underrepresentation observed does not extend to the noncoding strand, suggesting that the cause of the asymmetry is related to mRNA rather than DNA. Our data do not suggest that the underrepresentation is due to gene transfer from eukaryotes. We speculate that one possible explanation for these observations is that the protein coding sequence of Thermotogae, Archaea and Bacteria was at some time in the past subjected to selection against certain motifs appearing in an order which might initiate splicing in environments harboring a functional spliceosome. This is consistent with, but certainly does not prove, a hypothetical scenario in which at least some prokaryote lineages once possessed a functional spliceosome. Thus, we present a new quantitative method, observations obtained using the method, and a speculative discussion of a possible explanation of the observations. Introduction The origin of spliceosomal introns has been a matter of debate for some time . A review of this debate is beyond the scope of this paper, as would be any attempt at justifying our pragmatic decision to use the term “prokaryote” here . Like many others, we focus on introns of the type removed by the major spliceosome , but also consider minor spliceosomal 17 and some self-splicing introns . Our contribution is twofold. First, we introduce a novel, quantitative method of analysis, which is designed to detect traces of current or prior spliceosomal activity in complete genomes or chromosomes. We demonstrate the potential of the method by first applying it to two cryptophyte nucleomorph genomes , one of which has spliceosomal introns, the other of which has lost both spliceosomal introns and the spliceosome itself . Second, we apply our method to complete Thermotogae , archaeal 24 and bacterial genomes, showing that these do indeed show signs consistent with the hypothesis of prokaryotic intron loss, for spliceosomal introns of both major and minor type . We do not exclude all alternative hypotheses, but do show that horizontal transfer of genes from eukaryotes to prokaryotes, know to be rare , is unlikely to be the explanation for our observations. On the most abstract level, one can view our method as a black box that can, to a certain limited extent, recognize spliceosomal introns with only the Thermotogae, archaeal or bacterial genomes as biological input. See Tables 3, 4 and 5. The fact, that this is possible at all, suggests that Thermotogae, archaeal and bacterial genomes encode information sufficient to enable such a recognition. Thus, even when our method is divorced from the intron loss scenario which motivated its development, it still provides indirect support for the hypothesis of exposure to an active spliceosome in the coding sequence of complete Thermotogae, archaeal and bacterial genomes. In a broad sense, our data suggest that Thermotogae, Archaea and Bacteria “know something” about introns, and we wonder whether this apparent knowledge could in fact be a memory . Let us now describe the line of thought which led to our method. The central principle behind our work is this: Intronless genes must, by definition, avoid being spliced, whenever an active spliceosome is present. Given that the 5’, branch point and 3’ splice site motifs potentially define an intron sequence, one way in which an intronless gene might avoid the activity of a spliceosome is to refrain from presenting these motifs in the order associated with splicing. Thus, one expects that these motifs would be underrepresented in what we will call the canonical order, with respect to the same motifs in the reverse order. Since the spliceosome acts only on mRNA, one would only expect to see such an underrepresentation on the coding strand. The idea of using the order of sequence units is of course not new. Gene order 28 continues to be used as a phylogenetic marker for the simple reason that it is expected to be relatively stable over evolutionarily relevant periods. Since we make no attempt to reconstruct any phylogeny, the known practical difficulty of estimating evolutionary distance using gene order alone 29 does not impact our specific use of motif order as a relatively stable marker. The observations, that splice site motifs appear to be most strongly conserved in eukaryotes with nearly-complete intron loss , and that short introns appear to be spliced via an “intron definition” mechanism , suggest that splice site motifs may be a suitable set of features to use in the investigation of the prokaryotes’ hypothetical loss of introns. It should also be noted that the idea of RNA selection pressure, which we invoke when we state that we expect no underrepresentation on the non-coding strand, has also been used before in discussions concerning the evolution of splicing fidelity and alternative splicing in eukaryotes . In the literature, one finds discussions of the hypothesis that prokaryotes may once have possessed active spliceosomes but then lost their introns , via reverse transcription 35 or deletion , followed by the eventual loss of the genes required for construction of the spliceosome, since it would no longer have had any purpose. Such scenarios imply that essentially all of the genes in the genome in question would have been intronless at a time when a spliceosome was still active. Therefore, one might expect that there was a time when prokaryotic genes were under selection to avoid presenting splice site motifs in the canonical order. With the hypothesized loss of the spliceosome, this selective pressure would have ceased, and the underrepresentation, a direct result of avoidance, would have decayed with time. In this paper, we in fact demonstrate an underrepresentation in completely sequenced genomes of Thermotogae, Archaea and Bacteria. Due to the statistical nature of our method of analysis, we do not expect individual genes to be informative and therefore focus on complete genomes. It is only natural to ask whether various selective forces, such as codon usage bias or selection for a certain level of GC content, will lead to the loss of memory, at the DNA sequence level, of ancient events like the putative existence of a spliceosome. In order to test the robustness of our results, we have used random codon reassignments, not respecting original genomic GC content, in an attempt to provoke such a loss of memory. The positive result of this experiment, for the full bacterial data set, is presented in Figure 1, suggesting that our analysis is robust enough to capture an ancient signal. The question of which motifs would be appropriate to use in a search for evidence of prokaryotic genes’ exposure to an active spliceosome is not a simple one, particularly for major spliceosomal introns. Many of the motifs recognized in intron-poor eukaryotes may be derived or secondary . Minor spliceosomal introns do at least appear to possess wellconserved 5’ splice site and branch point sequences wherever they are found in eukaryotic genomes . We take the point of view that both the available prokaryotic genomic data and motifs known from eukaryotes should be taken into account. Thus, we have taken a diverse subset of the known eukaryotic motifs as a starting point, and have let the results of analysis of prokaryotic genomes guide us. From a philosophical point of view, we are therefore in danger of using circular argumentation. Any photograph of a new species potentially suffers from this problem if the photographer adjusted the focus of the camera in taking it, so one has to expect philosophical problems of this sort when more direct evidence is unavailable, as in our case. Our response is to provide means of falsification and to formulate testable predictions of our approach wherever possible. We do this in the Results and Discussion Section below. Meaning of the Tables It will be useful to give a brief, informal description of the method, to allow the reader to understand the tables without reading the detail of the Materials and Methods Section. The basis of the analysis is counting matches to patterns composed of three motifs (one can imagine GT, A and AG here, for the sake of discussion) with variable spacings between them, both in the order they are given and also in the reverse order (which would be AG, A and GT). This counting is performed on the coding sequence of each protein coding gene. If there are more matches to the given order, the gene is considered to show a bias towards the given order. If there are more matches to the reverse order, the gene is considered to show a bias towards the reverse order. A chromosome in which more genes show a bias towards the given order than the reverse order is considered to show a bias towards the given order. A chromosome in which more genes show a bias towards the reverse order than the given order is considered to show avoidance of the given order. The extent of avoidance is given a numerical value by taking ratios of these numbers of chromosomes. The significance of avoidance is quantified in terms of a one-sided P-value, computed with respect to shufflings of nucleotide content within each gene. In the tables, the numbers of chromosomes showing bias towards or avoidance of the given order of motifs are presented as well as the one-sided P-value. A star, indicating significance, appears whenever this P-value is less than 2.5%. In many cases, an analysis has also been performed using the inverse complements (which would be AC, T and CT) of the given motifs. These are marked as “inv.cpl.” in the tables. Matching these constitutes an analysis of the other, non-coding strand. What is important is that we expect to see a lack of significance in these cases. As described above, this analysis gives equal weight to all genes. The intention has been to be conservative. We do not know for certain if longer genes are more informative than short ones for our purposes. Since it is however natural to ask whether this weighting introduces an unintended bias, we have performed some calculations, using the full bacterial data set, in which genes were weighted according to their coding sequence length. These are described in the Results and Discussion Section. They suggest that our results are not dependent upon the use of equal weighting for all genes. We have also given equal weight to all chromosomes. Since prokaryotic genomes encoded on multiple chromosomes are relatively rare, we have not performed specific tests for any bias. We do believe that this issue is best understood in the broader context of the biased phylogenetic distribution of sequenced prokaryotic genomes . Genomes The nucleomorph genomes of the cryptophytes Guillardia theta 39 and Hemiselmis andersenii 21 provide us with a pair of eukaryotic genomes which, although not closely related , are the current best data set for studying complete intron loss. The Guillardia theta nucleomorph possesses a few short (42 to 52nt) AT-rich spliceosomal introns with 5’ and 3’ splice site consensus sequences GTAAGTAT and AG respectively. A branch-point consensus sequence has not been identified, although it is reasonable to assume an adenosine in the intron sequence serves as a branch point , and we note that CTAA is the core of a common branch point motif in intron-poor eukaryotes . In order that the motifs not be over-specified, which would result in poor statistics, we chose the motifs GTNNGT, TAA, and TAG or CAG as our 5’, branch point and 3’ splice site motifs, respectively. For Guillardia theta, we searched for underrepresentation corresponding to introns with lengths from 42 to 52nt. The nucleomorph of Hemiselmis andersenii possesses neither spliceosomal introns nor a spliceosome. We allowed for the possibility it may have had even shorter introns than the Guillardia theta nucleomorph before it lost them, noting that chlorarachniophyte nucleomorphs do have very short introns , and so searched for underrepresentation corresponding to intron lengths from 37 to 52nt. It is thought that Thermotogae have been involved in significant horizontal gene transfer . The fact that the overwhelming majority of these transfers have been identified as being within prokaryotes does allow us to treat Thermotogae as prokaryotes, and this is consistent with our aim of quantitatively investigating the hypothesis of prokaryotic intron loss. As a first test of our method, the Thermotogae have the advantage of being in this sense generic and also of being a phylum with a conveniently small number of fully sequenced genomes (eleven) to analyse. In the case of Thermotogae, we first used the motif triplets GTNNGT / TAA / TAG and GTNNGT / TAA / CAG to investigate possible avoidance of eukaryotic major spliceosomal splice site motifs. GTNNGT is our attempt at a balance between too much specificity, which would reduce the statistical strength of the analysis due to too infrequent matchings, and too little specificity, which would include nucleotide patterns a spliceosome may not recognize (see , Figure 1 of 44 and Figure 2 of 32 for a variety of examples of known 5’ splice site motifs). The choice of TAA for the branch point motif is motivated by the fact that most branch point sequences in intron-poor species studied to date contain TAA rather than the more general TRA (see Table 1 of 31 and also ). To investigate possible avoidance of eukaryotic minor spliceosomal intron splice site motifs, we first used GTNNCC / TAA / CAG as well as ATNNCC / TTAA / CAC, making use of the greater conservation of the motifs within eukaryotes . Since many unicellular eukaryotes harbour short to ultrasmall introns , we specified intron lengths from 17 to 52nt. In order to see whether the results reported for Thermotogae could in fact be representative of other prokaryotic lineages, we also applied our method of analysis to the domains Archaea and Bacteria, each as a whole, using the same seed motifs and intron length range as for the Thermotogae. The principal restriction here is in the number of chromosomes (90 for Archaea and 1177 for Bacteria), which increases the computer time required for the analysis. We have also applied our method of analysis to complete bacterial genomes of the intracellular pathogens of eukaryotes Legionella pneumophila , Legionella longbeachae 48 and Coxiella burnetii , the last an obligate intracellular acidophile. All of these are reported to have a number of genes similar to eukaryotic genes , otherwise rare in Bacteria . 11 genomes (5 Legionella pneumophila strains, 1 Legionella longbeachae strain and 5 Coxiella burnetii strains) are the only complete genomes in the !-proteobacterial order Legionellales in RefSeq 51 Release 44. We used all of them. Results and Discussion The Guillardia theta nucleomorph genome showed evidence of underrepresentation for the motifs GTNNGT, TAA and TAG (see Table 1), consistent with avoidance of an active spliceosome using these motifs, which we know to be active. In contrast to this, there is no evidence of underrepresentation for the motifs GTNNGT, TAA and CAG. This would imply that the nucleomorph spliceosome does not splice introns with these motifs – a testable prediction. The list of introns included in Table 2 of the Supplementary Information of the Guillardia theta nucleomorph genome paper 39 does however include an intron with just these motifs in the gene identified as Yrpl24. Chromosome 2 of the Guillardia theta nucleomorph contains this intron sequence in its entirety, within the pseudogene for gene rpl24 with the locus tag GTHECHR2056. The same chromosome also encodes another gene rpl24 with locus tag GTHECHR2057, to which the automatic annotation pipeline has assigned the protein id XP_001713328.1. A cDNA sequence (GenBank: EG716478.1) extracted from Guillardia theta cells (from total RNA ) contains this intron sequence, except for the two initial nucleotides GT of the 5’ splice site, still joined to the remainder of the rpl24 pseudogene (i.e. not spliced) and also most of the tRNA-Arg gene with locus tag GTHECHR2t102. We took what would have been the coding sequence of the pseudogene, removed the putative intron, and searched for an EST bridging the putative intron site, but without success. We interpret all this to mean that the only intron listed in the Supplementary Information of the Guillardia theta nucleomorph genome paper with a 3’ splice site motif of CAG is not an intron that is spliced by the Guillardia theta nucleomorph spliceosome. Thus, we see no available evidence to contradict the claim that the Guillardia theta nucleomorph spliceosome does not splice short introns with the motifs GTNNGT, TAA and CAG. We see no significant signal on the noncoding strand. The lack of significance in this case, most easily seen by noting (in Table 1) that two of the three chromosomes showed neither undernor overrepresentation of the set of motifs, indicates a lack of underrepresentation of intron-like sequences on the non-coding strands of Guillardia theta nucleomorph genes, consistent with our tentative interpretation of the observed underrepresentation being a result of hypothetical selection against splicing only, and therefore only applicable to mRNA. The Hemiselmis andersenii nucleomorph data mirrors that for Guillardia theta, and this is important because it strongly suggests that genomes which have lost their spliceosome can carry a trace of its prior activity. As for Guillardia theta, we find no significant underrepresentation for the motifs GTNNGT, TAA and CAG. For the motifs GTNNGT, TAA and TAG, we do see significant underrepresentation, once again only on the coding strand. See Table 2. The fact that the observed underrepresentation is weaker for Hemiselmis andersenii than for Guillardia theta is consistent with a scenario involving a slowly decaying signal following spliceosome loss. In the case of Thermotogae, we see no significant underrepresentation using the motifs GTNNGT, TAA and TAG. This is reminiscent of the observation that the eukaryote Trichomonas vaginalis does not splice introns with a TAG 3’ splice site motif, instead sharing a long and required ACTAACACACAG 3’ splice site motif with at least one Giardia intestinalis intron . Using the motifs GTNNGT, TAA and CAG with Thermotogae genomes, we do see significant underrepresentation on the coding strand but not on the noncoding strand (Table 3). We investigated whether the signal would survive if we specified the branch point sequence more completely, and find that the motifs GTNNGT, CTAA and CAG are also significantly underrepresented on the coding strand, in spite of the expected decrease in the raw numbers of matches (which would typically lead to poorer statistics). We also investigated the effect of changing the nucleotide upstream of the branch point adenosine, since this can vary in eukaryotes , but only find significance using the TAA motif. We also (Table 3) observed significant underrepresentation for splicing motifs usually associated with the minor spliceosome: GTNNCC, TAA and CAG , and so investigated further, to see whether non-canonical AT-AC termini were also avoided. We did not detect any significant avoidance for such motif triplets, suggesting that a putative Thermotogae spliceosome, if it ever existed, may have been a major spliceosome of a permissive type . These results are useful in that they indicate that our method could in principle detect splicing signals that differ from the ones we see today in intron-poor eukaryotes. Corresponding results for Archaea are presented in Table 4. One difference is that we now also observe avoidance of the motifs GTNNGT, TGA and CAG, which is more consistent with the variation observed in intron-poor eukaryotes , in particular the common branch point consensus sequence TRA. In the case of Bacteria, our results are presented in Table 5 and Figure 1. The large number (1177) of chromosomes restricted our analysis, but strong trends can still be discerned. In addition to the types of avoidance seen in Thermotogae and Archaea, we now also see avoidance of non-canonical AT-AC termini of a clear minor spliceosomal type, corresponding to the motifs ATNNCC, TTAA and YAC . This is potentially interesting, since it has been supposed that major spliceosomal introns are ancestral to minor spliceosomal introns . In the bacterial data set, we also detect avoidance of the group II intron-like splice site motifs GTGNG, CTA 55 and AT, which is interesting from an evolutionary point of view , even though these introns are not spliceosomal. Given that we are examining only short intron-like sequences, this signal could conceivably be due to a mechanism of ORF-less group II intron repression in bacteria , but studying this question would be beyond the scope of this paper. To determine whether our assignment of equal weight to each gene introduces an unwelcome bias into our analysis, we have recomputed the analyses for the motif triples GTNNGT / TAA / CAG and GTNNGT / TAA / TAG with the full bacterial data set, but this time weighting genes according to their coding sequence length in nucleotides. Apparent significant avoidance of these two triples is central to the conclusions of this paper. Our raw results for the former triple are 520 chromosomes showing avoidance of the given order and 657 chromosomes showing a bias for the given order. 2000 shufflings resulted in a one-sided P-value estimate of 0.0005, which is consistent with the earlier estimate (Table 5), based upon equal gene weighting, of <0.001. In the case of the latter triple, we find the chromosome numbers 478 and 690, respectively, and estimate the one-sided P-value to be <0.01. This is also consistent with the estimate based upon equal gene weighting. We conclude that our results are robust with respect to the choice of weighting. Figure 1 is important because it demonstrates that the use of motif triplet ordering does provide what one would call a stable phylogenetic marker. First, when using the full bacterial data set and shuffling 20,000 times, the distribution of the ratio (of the number of chromosomes showing avoidance to the number of chromosomes showing either avoidance or bias for the given order of the motifs GTNNGT, TAA and CAG) was centred at 50%. This is the blue peak in Figure 1. One would hope to see this for a large and highly diverse genomic data set. We then took the entire bacterial data set again, randomly reassigning all codons (excepting the start and stop codons) in all genes on all chromosomes 4000 times, performing our analysis anew each time. The distribution of these randomly reassigned genomic data sets is the green peak in Figure 1, which contains the raw genomic result (black vertical line), and remains separate from the blue peak. Our conclusion is that random reassignment of codons, which changes GC content as a side effect, does not destroy the signal we observe. Horizontal transfer of intronless protein coding genes from eukaryotes to prokaryotes 26,58 could in principle explain our observations. In order to understand whether this explanation is likely to be true, we also analysed the eleven available completed bacterial genomes of intracellular pathogens of eukaryotes in the proteobacterial order Legionellales. Legionella pneumophilia is known to harbour a relatively large number of eukaryote-like proteins. A recent, careful study 50 identified 14 eukaryotic-like proteins as having been definitely acquired from eukaryotes, making up less than 1% of the genome, out of a total of 102 eukaryotic-like proteins, and suggested that gene acquisition from eukaryotes is an ongoing process. If our observations were due to horizontal transfer from eukaryotes to prokaryotes, then these intracellular pathogens of eukaryotes would be expected to show a particularly clear signal of avoidance. Our data, presented in Table 6, however, do not show significant avoidance at all. Thus, the hypothesis, that the avoidance we observe in Archaea and Bacteria, each treated as a whole, may be explained by horizontal transfer of genes from eukaryotes to prokaryotes, would appear to be unlikely. We have observed that Thermotogae exhibit significant avoidance of intron-like sequences on the coding strand of protein coding genes, but Legionellales do not. It is tempting to interpret this as indicating that Thermotogae would have lost introns more recently than Legionellales, but we do not draw any such conclusion here since thermophiles do appear to have especially low mutation rates in comparison with mesophiles , and may thus retain an ancient signals more robustly, whereas intracellular pathogens tend to be subject to unusually intense genome reduction , and this may contribute to the erasure of ancient signals. It is on the basis of all these data that we suggest we may have found evidence in favour (not proof) of the hypothesis of prior exposure to an active spliceosome in Thermotogae, archaeal and bacterial genomes. Note that we have used all available archaeal and bacterial genomes in our analyses of Archaea and Bacteria, respectively, making no attempt to correct for sampling bias . For this reason, we do not claim to have provided evidence here for every archaeon or every bacterium on the level of individual isolates or even phyla (except for the Thermotogae). Rather, we have observed a general, but significant, trend for each domain separately. Although one cannot assume that putative prokaryotic splicing signals would necessarily be similar to the ones we see today in intron-poor eukaryotes, our data do suggest a degree of similarity. Our proposition, that we may have evidence consistent with (but not proving) prokaryotic intron loss, could be falsified in at least one way: Any prokaryotic mRNA binding factor using the same, or very similar, motifs to a eukaryotic spliceosome might explain our results without needing to invoke the prior existence of active spliceosomes in prokaryotes. We have conducted a literature search, but are not aware of any viable candidate. The basic ideas upon which our work is based, (i) that intronless genes must avoid attracting the attention of the spliceosome, and (ii) that this should be reflected in an underrepresentation of intron-like sequences in intronless coding genes, can be tested. On the basis of our analysis of the Guillardia theta nucleomorph genome, we predict that the Guillardia theta nucleomorph spliceosome is unable to splice introns with a CAG 3’ splice site sequence. Materials and Methods Complete genomic sequences were downloaded from RefSeq 51 Release 44. Given three ordered motifs, limits on the lengths of matched sequences, and a set of complete genomes or chromosomes, we perform our analysis in the following way. For practical reasons, we pad shorter motifs with the symbol “N” 61 until all motifs have the same length. Thus, the triplet GTNNGT / TAA / TAG becomes GTNNGT / TAANNN / TAGNNN. For each genome or chromosome, we perform the following operations on every contiguous (intronless) protein coding sequence. In each coding sequence, excepting the start and stop codons, we count the number of matches of the motifs in the given order, and also in the reverse order. We require at least a one nucleotide gap between motifs, and a total length, including the padding, between the specified limits. For example, the sequence ACTAGTACAGTAACGGTGTAAGTAGATAACTTTTTAGGACT contains exactly one match for each ordering. Each coding sequence is then assigned the value +1, -1 or 0, depending upon whether there were more matches to the given order, more matches to the reverse order, or equal numbers of matches to both orders, respectively. For each genome or chromosome, we sum these values for all contiguous protein coding sequences, giving all genes analyzed equal weight. The genome or chromosome is then assigned a value of +1, -1 or 0, depending upon whether there were more contiguous protein coding sequences with the value +1, more contiguous protein coding sequences with the value -1, or equal numbers of contiguous protein coding sequences with the values +1 and -1, respectively. The type of underrepresentation we are interested in expresses itself in there being an underrepresentation of genomes or chromosomes with the value +1 as compared to those with the value -1. Let G (for “given” order) denote the number of genomes or chromosomes with value +1, and R (for “reverse” order) the number of genomes or chromosomes with the value -1. In Tables 1 to 6, G is provided in the second column, while R is provided in the third column. To evaluate the significance of any underrepresentation, we compute a one-sided P-value in the following manner, the purpose of which is to compensate for gradients in codon use along protein coding sequences . We perform the same analysis as described above for 20,000 (unless otherwise specified) independently generated shuffled copies of the given genomes or chromosomes, shuffling within each contiguous protein coding sequence as follows: We divide up the subsequence between the start and stop codons into windows of 9nt width, and randomly permute the nucleotides in each window, not allowing in-frame stop codons to be created in the process. In each case, let G’ and R’ denote the counts corresponding to G and R as described above, but for the shuffled sequence data. We approximate the one-sided Pvalue by the ratio LE/(LE+GT), where LE is the number of times, out of the 20,000 shufflings, G’/(G’+R’) is less than or equal to G/(G+R), and GT is the number of times it is greater than G/(G+R). Cases in which G+R=0 and/or G’+R’=0 did not occur. Note that small values of this one-sided P-value do mean that there is significant underrepresentation of the motifs in the given order, but large values do not automatically mean that there is significant overrepresentation. We have used P<2.5% as our threshold for significance throughout. We look for underrepresentation on the noncoding strand by using motifs which are the inverse complements of those used for the coding strand, requiring care with the concepts of given and reverse order. No significant underor overrepresentations were actually observed on the noncoding strand. Acknowledgements I would like to thank Dr. Takayuki Naito of the Molecular Neuroscience Unit at OIST, Japan, and Prof. Byrappa Venkatesh of the Molecular Genetics Lab, Institute of Molecular and Cell Biology, Singapore, both of whom have commented on results and interpretations at various stages in this long-term project, giving valuable constructive criticism and advice. This manuscript is an expanded version of a poster of the same name presented at the Annual Meeting of the Society for Molecular Biology and Evolution in Lyon, France, in July 2010, in which the method was described and applied to the cryptophyte nucleomorph, Thermotogae, Clostridium and Crenarchaeota genomes (from RefSeq 41). Figure 1: Robustness of phylogenetic signal. The black vertical line represents the percentage of 1177 bacterial chromosomes which show avoidance of the motifs GTNNGT, TAA and CAG, in the given order, with respect to the reverse ordering. The green peak was obtained by randomly reassigning all codons in all genes 4000 times, modeling the effects of diverse selective forces on codon bias and GC content. The blue peak was obtained by randomly shuffling nucleotides within 9-base-pair windows in all genes 20,000 times, and therefore represents a null model of non-informative gene sequences. What can be seen is that the codon reassignment peak not only contains the actual data, but remains separate from the null model. We conclude that the phylogenetic signal is likely to be robust with respect to the effects of selective forces on codon bias and GC content. Table 1: Analysis of Guillardia theta nucleomorph chromosomes. Motifs Given Order Chromosomes Given Order Chromosomes Reverse Order One-Sided P-Value GTNNGT / TAA / TAG ACNNAC / TTA / CTA (inv.cpl.) 1 1 2 0 0.003 * 1 GTNNGT / TAA / CAG 3 0 1 GTNNCT / AGA / GAT 2 1 0.177 This nucleomorph possesses an active spliceosome. Splice site motifs GTNNGT, TAA and TAG are avoided in coding sequences on the coding strand. On the non-coding strand we see neither significant undernor overrepresentation, as evidenced by the fact that two out of three chromosomes contained equally many genes with bias for or against the given order of inverse complemented motifs. Splice site motifs GTNNGT, TAA and CAG are not avoided in coding sequences. The high one-sided P-value is due to 74% of all shuffled genomic datasets sharing the counts of 3 and 0 chromosomes showing a bias towards the given or reverse order, respectively. Thus, we observe no underor overrepresentation for the motifs GTNNGT, TAA and CAG. The motifs GTNNCT / AGA / GAT are intended to be a control. Table 2: Analysis of Hemiselmis andersenii nucleomorph chromosomes. Motifs Given Order Chromosomes Given Order Chromosomes Reverse Order One-Sided P-Value GTNNGT / TAA / TAG ACNNAC / TTA / CTA (inv. cpl.) 1 2 1 1 0.013 * 0.569 GTNNGT / TAA / CAG 2 1 0.246 GTNNCT / AGA / GAT 2 1 0.247 This is a eukaryotic genome possessing neither spliceosomal introns nor a spliceosome. We observe avoidance of the splice site motifs GTNNGT, TAA and TAG on the coding strand only, consistent with a memory of intron loss. Table 3: Analysis of 11 Thermotogae genomes. Motifs Given order Genomes Given Order Genomes Reverse Order One-sided P-value (n) GTNNGT / TAA / CAG ACNNAC / TTA / CTG (inv. cpl.) 1 5 9 6 0.005 * 0.499

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Loss of Chloroplast trnLUAA Intron in Two Species of Hedysarum (Fabaceae): Evolutionary Implications

Previous studies have indicated that in all land plants examined to date, the chloroplast gene trnLUAA isinterrupted by a single group I intron ranging from 250 to over 1400 bp. The parasitic Epifagus virginiana haslost, however, the entire gene. We report that the intron is missing from the chloroplast genome of twoarctic species of the legume genus Hedysarum (H. alpinum, H. ...

متن کامل

Investigating Bhattacharya Hypothesis about the Effect of Dividend Signal on Information Asymmetry Risk: An Earnings Transparency Approach

Information asymmetry in stock market can increase the risk of investment which in turn increases the capital cost of firms. Bhattacharya (1979) proposed a hypothesis that states dividend can act as a powerful signal in order to solve information asymmetry problem. We measured information asymmetry by lack of earnings transparency. Therefore we examine the effect of earnings transparency on cap...

متن کامل

Investigating the Impact of Growth of Petroleum Products Consumption on Economic Development with a Systematic Dynamics Approach in Developing Countries

I n addition to labor force and capital, energy plays a significant role in the production of commodities and services. Energy is the driving force of production activities. Therefore, it is an essential ingredient of growth and development. Results obtained from this paper show that the growth of oil products consumption has a positive effect on economic development via two channels: Firstly,...

متن کامل

A Mixed-methods Approach to Investigating Iranian EFL Learners’ Attitudes towards Academic Motivation in Learning Vocabulary

Abstract The present study aims at analyzing EFL learners’ attitudes towards motivating factors in learning vocab-ulary. A mixed-methods approach was used to conduct the study. In the qualitative analysis, Iranian EFL learners’ attitudes towards learning vocabulary were investigated through a semi-structured interview en-compassing 36 participants. Subsequently, a 56-item ‘motivation for v...

متن کامل

Investigating causal linkages and strategic mapping in the balanced scorecard: A case study approach in the banking industry sector

One of the main challenges of strategic management is implementing the strategies. Designing the strategy map in Balanced Scorecard framework to determine the causality between strategic objectives is one of the most important issues in implementing the strategies. In designing the strategy map with intuition and judgment, the link between strategic objectives is not clear and it is not obvious...

متن کامل

Investigating the Adequacy of Nurses\' Training based on a Hybrid Approach

Background: In the present era, training is one of the most complex tasks in the human resource management of organizations and also one of the factors in organizational productivity. This study was conducted to identify the dimensions and components of adequate training of nurses with an emphasis on psychological empowerment. Materials and Methods: The present study is quantitative and qualit...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

A Quantitative Approach to Investigating the Hypothesis of Prokaryotic Intron Loss

نویسنده

چکیده

منابع مشابه

Loss of Chloroplast trnLUAA Intron in Two Species of Hedysarum (Fabaceae): Evolutionary Implications

Investigating Bhattacharya Hypothesis about the Effect of Dividend Signal on Information Asymmetry Risk: An Earnings Transparency Approach

Investigating the Impact of Growth of Petroleum Products Consumption on Economic Development with a Systematic Dynamics Approach in Developing Countries

A Mixed-methods Approach to Investigating Iranian EFL Learners’ Attitudes towards Academic Motivation in Learning Vocabulary

Investigating causal linkages and strategic mapping in the balanced scorecard: A case study approach in the banking industry sector

Investigating the Adequacy of Nurses\' Training based on a Hybrid Approach

عنوان ژورنال:

اشتراک گذاری